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Abstract 

Tarang is a general-purpose pseudospectral parallel code for simulating flows 
Y^ • involving fluids, magnctohydrodynamics, and Rayleigh-Benard convection in 

. — , turbulence and instability regimes. In this paper we present code validation 

and benchmarking results of Tarang. We performed our simulations on 1024 3 , 
(— ! , 2048 3 , and 4096 3 grids using the HPC system of IIT Kanpur and Shaheen of 

O i 1 KAUST. We observe good "weak" and "strong" scaling for Tarang on these 

systems. 
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^") ■ 1. Introduction 

en . 

f^ , A typical fluid flow is random or chaotic in the turbulent and instability 

CNJ ' regimes. Therefore we need to employ accurate numerical schemes for simulat- 

ing such flows. A pseudospectral algorithm [l|, [2f is one of the most accurate 
methods for solving fluid flows, and it is employed for performing direct numeri- 
cal simulations of turbulent flows, as well as for critical applications like weather 
r% ■ predictions and climate modelling. Yokokawa et al. [3, |4(, Donzis et al. [5[, and 

C$ , Pouquet et al. [6[ have performed spectral simulations on some of the largest 

grids (e.g., 4096 3 ). 

We have developed a general-purpose flow solver named Tarang (synonym for 
waves in Sanskrit) for turbulence and instability studies. Tarang is a parallel and 
modular code written in object-oriented language C++. Using Tarang, we can 
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solve incompressible flows involving pure fluid, Rayleigh-Benard convection, pas- 
sive and active scalars, magnctohydrodynamics, liquid-metals, etc. Tarang is an 



open-source code and it can be downloaded from http: //turbulence .phy . iitk.ac . in 



In this paper we will describe some details of the code, scaling results, and code 
validation performed on Tarang. 

2. Salient features of Tarang 

The basic steps of Tarang follow the standard procedure of pscudospectral 
method [l|, [2j . The Navier-Stokes and related equations are numerically solved 
given an initial condition of the fields. The fields are time-stepped using one 
of the time integrators. The nonlinear terms, e.g. u • Vu, transform to con- 
volutions in the spectral space, which are very expensive to compute. Orszag 
devised a clever scheme to compute the convolution in an efficient manner using 
Fast Fourier Transforms (FFT) [l|,|2|]. In this scheme, the fields are transformed 
from the Fourier space to the real space, multiplied with each other, and then 
transformed back to the Fourier space. Note that the spectral transforms could 
involve Fourier functions, sines and cosines, Chebyshcv polynomials, spherical 
harmonics, or a combination of these functions depending on the boundary con- 
ditions. For details the reader is referred to standard references, e.g., the books 
by Boyd [l( and Canuto et al. Q • Some of the specific choices made in Tarang 
are as follow: 

(a) In the turbulent regime, the two relevant time scales, the large-eddy turnover 
time and the small-scale viscous time, are very different (order of magnetic 
apart). To handle this feature, we use the "exponential trick" that absorbs 
the viscous term using a change of variable [2J . 

(b) We use the fourth-order Runge-Kutta scheme for time stepping. The code 
however has an option to use the Euler and the second-order Runge-Kutta 
schemes as well. 

(c) The code provides an option for dealiasing the fields. The 3/2 rule is used 
for dealiasing [2(. 

(d) The wavenumber components ki are 

2ir 
h = —ni (1) 

where Li is the box dimension in the z-th direction, and ni is an integer. 
We use parameters 

2-7T 

kfactor.; = — (2) 

L>i 

to control the box size, especially for Rayleigh-Benard convection. Note 
that typical spectral codes take kfactor^ = 1, or fcj = m. 

The parallel implementation of Tarang involved parallclization of the spec- 
tral transforms and the input-output operations, as described below. 



3. Parallelization Strategy 

A pscudospectral code involves forward and inverse transforms between the 
spectral and real space. In a typical pscudospectral code, these operations take 
approximately 80% of the total time. Therefore, we use one of the most efficient 
parallel FFT routines, FFTW (Fastest Fourier Transform in the West) Q, in 
Tarang. We adopt FFTW's strategy for dividing the arrays. If p is the number 
of available processors, we divide each of the arrays into p "slabs" . For exam- 
ple, a complex array A(N 1 ,N 2 ,N 3 /2 + 1) is split into A(N 1 /p,N 2 ,N 3 /2 + 1) 
segments, each of which is handled by a single processor. This division is called 
"slab decomposition" . The other time-consuming tasks in Tarang are the input 
and output (I/O) operations of large data sets, and the elcmcnt-by-element mul- 
tiplication of arrays. The data sets in Tarang arc massive, for example, the data 
size of a 4096 3 fluid simulation is of the order of 1.5 terabytes. For I/O opera- 
tions, we use an efficient and parallel library named HDF5 (Hierarchical Data 
Format-5). The third operation, element-by-element multiplication of arrays, is 
handled by individual processors in a straightforward manner. 

Tarang has been organized in a modular fashion, so the spectral transforms 
and I/O operations were easily parallelized. For a periodic-box, we use the 
parallel FFTW library itself. However, for the mixed transforms (e.g., sine 
transform along x, and Fourier transform along yz plane), we parallelize the 
transforms ourselves using one- and two-dimensional FFTW transforms. 

An important aspect of any parallel simulation code is its scalability. We 
tested the scaling of FFTW and Tarang by performing simulations on 1024 3 , 
2048 3 , and 4096 3 grids with variable number of processors. The simulations 
were performed on the HPC system of IIT Kanpur and Shaheen supercomputer 
of King Abdullah University of Science and Technology (KAUST). The HPC 
system has 368 compute nodes connected via a 40 Gbps Qlogic Infiniband switch 
with each node containing dual Intel Xcon Quadcorc C5570 processor and 48GB 
of RAM. Its peak performance (Rpeak) is approximately 34 teraflops (tera float- 
ing point operations per second). Shaheen on the other hand is a 16-rack IBM 
BlueGene/P system with 65536 cores and 65536 GB of RAM. Shaheeri's peak 
performance is approximately 222 teraflops. 

For parallel FFT with slab decomposition, we compute the time taken per 
step (forward+backward transform) on Shaheen for several large A 3 grids. The 
results displayed in Fig. Q] demonstrate an approximate linear scaling (called 
"strong scaling" ) . Using the fact that each forward plus inverse FFT involves 
5 A 3 log A 3 operations for single precision computations [7| , the average FFT 
performance per core on Shaheen is approximately 0.3 gigaflops, which is only 
8% of its peak performance. Similar efficiency is observed for the HPC system as 
well, whose cores have rating of approximately 12 gigaflops. The aforementioned 
loss of efficiency is consistent with the other FFT libraries, e.g, p3dfft |8(. Also 
note that an increase in the data size and number of processors (resources) by 
a same amount takes approximately the same time (see Fig. [1]). For example, 
FFT of a 1024 3 array using 128 processors, as well as that of a 2048 3 array on 
1024 processors, takes approximately 4 seconds. Thus our implementation of 
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Figure 1: Scaling of parallel FFT on Shaheen for 1024 J , 2048 J and 4096 3 grids with single 
precision computation. The straight lines represent the ideal linear scaling. 



FFT shows good "weak scaling" as well. 

We also test the scaling of Tarang on Shaheen and the HPC system. Figs. [2] 
and[3]cxhibit the scaling results of fluid simulations performed on these systems. 
Fig. @] shows the scaling results for magnetohydrodynamics (MHD) simulation 
on Shaheen. These plots demonstrate strong scaling of Tarang, consistent with 
the aforementioned FFT scaling. Sometimes we observe a small loss of efficiency 
when N = p. We also observe approximate weak scaling for Tarang on both 
Shaheen and the HPC system. 

A critical limitation of the "slab decomposition" is that the number of pro- 
cessor cannot be more than N\ . This limitation can be overcome in a new scheme 
called "pencil decomposition" in which the array A(Ni,N2,Ns/2 + 1) is split 
into A(Ni/pT,N2/p2,N^/2 + 1) pencils where the total number of processors 
p = Pi x p 2 [8[ . We are in the process of implementing "pencil decomposition" 
on Tarang. In this paper we will focus only on the "slab decomposition" . 

After the above discussion on parallelization of the code, we will discuss code 
validation, and time and space complexities for simulations of fluid turbulence, 
Rayleigh-Benard convection, and magnetohydrodynamic turbulence. 



4. Fluid turbulence 

The governing equations for incompressible fluid turbulence are 

<9 t u+(u-V)u = -Vp + ;A7 2 u + F", 
V • u = 0, 



(3) 
(1) 
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Figure 2: Scaling of Tarang's fluid solver on Shaheen for 1024 3 and 2048 3 grids with single 
precision computation. The straight lines represent the ideal linear scaling. 
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Figure 3: Scaling of Tarang's fluid solver on the HPC system of IIT Kanpur for 1024 3 , 2048 3 , 
and 4096 3 grids with single precision computation. The straight lines represent the ideal linear 
scaling. 
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Figure 4: Scaling of Tarang's magnetohydrodynamic (MHD) solver on Shaheen for 1024 3 and 
2048 3 grids with single precision computation. The straight lines represent the ideal linear 
scaling. 



where u is the velocity field, p is the pressure field, v is the kinematic viscosity, 
and F" is the external forcing. For studies on homogeneous and isotropic tur- 
bulence, simulations are performed on high-resolution grids (e.g, 2048 3 ,4096 3 ) 
with a periodic boundary condition. The resolution requirement is stringent 
due to N ~ Re 3 / 4 relation; for Re = 10 5 , the required grid resolution is approx- 
imately 5600 3 , which is quite challenging even for modern supercomputers. 

Regarding the space complexity of a forced fluid turbulence simulation, 
Tarang requires 15 arrays (for u(k), u(r),F u (k), nlin(k), and three temporary 
arrays), which translates to approximately 120 gigabytes (8 terabytes) of mem- 
ory for 1024 3 (4096 3 ) double-precision computations. Here k and r represent the 
wavenumbers and the real space coordinates respectively. The requirement is 
halved for a simulation with single precision. Regarding the time requirement, 
each numerical step of the fourth-order Runge-Kutta (RK4) scheme requires 
9x4 FFT operations. The factor 9 is due to the 3 inverse and 6 forward trans- 
forms performed for each of the four RK4 iterates. Therefore, for every time 
step, all the FFT operations require 36 x 2.5 x N 3 \og 2 (N 3 ) multiplications for a 
single precision simulation |7|, which translates to approximately 2.9 (185) tera 
floating-point operations for 1024 3 (4096 3 ) grids. The number of operations for 
double-precision computation is twice of the above estimate. On 128 processors 
on HPC system, a fluid simulation with single-precision takes approximately 
36 seconds (see Fig. [3{, which corresponds to per core performance of approx- 



imately 0.68 gigaflops. This is only 6% of the peak performance of the cores, 
which is consistent with the efficiency of FFT operations discussed in Section [3j 
Also note that the solver also involves other operations, e.g., element-by-elemcnt 
array multiplication, but these operations take only a small fraction of the total 
time. 

We can also estimate the total time required to perform a 4096 3 fluid sim- 
ulation. A typical fluid turbulence would require 5 eddy turnover time with 
dt « 5 x 10~ 4 , which corresponds to 10 4 time steps for the simulations. So 
the total floating point operations required for this single-precision simulation 
is 185 x 10 4 tera floating-point operations for the FFT itself. Assuming 5% effi- 
ciency for FFT, and FFTs share being 80% of the total time, the aforementioned 
fluid simulation will take approximately 128 hours on a 100 tcraflop cluster. 

We perform code validation of the fluid solver using Kolmgorov's theory [9[ 
for the third-order structure function, according to which 



Si l (r) = ({«|.(x + r)- 



|(x)} 3 ) = --er 



(5) 



where e is the energy flux in the inertial range, and (...) represents ensemble 
averaging (here spatial averaging) . We compute the structure function S% (r) , as 
well as S§ (r), S 7 (r), and S$ (r) for the steady-state dataset of a fluid simulation 
on a 1024 3 grid. The computed values of S q (r) are illustrated in Fig. [5] that 
shows a good agreement with Kolmogorov's theory. 




Figure 5: Plots of the normalized odd-order structure functions — S„(r)/(er) n ' 3 vs. r/rj for a 
fluid simulation using Tarang. Here e is the energy flux, and rj is the Kolmogorov scale. 



After the discussion on fluid solver, we move on to the module for solving 
Rayleigh-Benard convection. 



5. Rayleigh-Benard convection 

Raylcigh-Bcnard convection (RBC) is an idealized model of convection in 
which fluid is subjected between two plates that are separated by a distance 
d, and are maintained at temperatures To and To — A. The equations for the 
above fluid under Boussinesq approximations are 



<9 t u+(u-V)u = 


V<7 2 

= h aqOz + vV u, 

Pa 


(6) 


d t e + (u • v)0 = 


A o 
= -u z + kV 2 6>, 
a 


(7) 


Vu = 


-- 


(8) 



where 9 and a are the temperature and pressure fluctuations from the steady 
conduction state (T = T c + with T c as the conduction temperature profile), 
z is the buoyancy direction, A is the temperature difference between the two 
plates, v is the kinematic viscosity, and k is the thermal diffusivity. We solve the 
nondimensionalized equations, which are obtained using d as the length scale, 
n/d as the velocity scale, and A as the temperature scale: 

(711 

— + (u-V)u = -Vct + RPBz + TV 2 u, (9) 

f)0 

— + (u-V)0 = u 3 + \7 2 9. (10) 

ot 

Here the two important nondimensional parameters are the Rayleigh number 
R = agAd 3 /vk, and the Prandtl number P = v / ' k. In Tarang we can apply the 
free-slip boundary condition for the velocity fields at the horizontal plates, i.e., 

u 3 = d z U\ = d z u 2 = for z = 0, 1, (11) 

and isothermal boundary condition on the horizontal plates 

6 = for z = 0, 1, (12) 

Periodic boundary conditions are applied to the vertical boundaries. 

The number of arrays required for a RBC simulation is 18 (15 for fluids plus 
three for 0(k),6'(r), nlin ). Thus the memory requirement for RBC is (18/15) 
times that for the fluid simulation. Regarding the time complexity, the number 
of FFT operations required per time step is 13 x 4 FFT operations (4 inverse 
+ 9 forward transforms per RK4 step). As a result, the total time requirement 
for a RBC simulation is (13/9) times the respective fluid simulation. 

For code validation of Tarang's RBC solver, we compare the Nusselt number 
Nu = 1 + (u 3 9) computed using Tarang with that computed by Thual [10| for 
two-dimensional free-slip box. The analysis is performed for the steady-state 
dataset. The comparative results shown in Table [1] illustrate excellent agree- 
ment between the two runs. We also compute the Nusselt number for a three- 
dimensional flow with Pr = 6.8 and observe that Nu = (0.27±0.04)(Pri?a)°- 27±001 



ll| . which is in good agreement with earlier experimental and numerical results. 



Table 1: Verification of Tarang against Thual's tlOII 2D RBC simulations. We compare Nussclt 
numbers (Nu) computed in our simulations on a 64 2 grid against Thual's simulations on 16 2 
(THU1), 32 2 (THU2), and 64 2 (THU3) grids. All Nu values tabulated here arc for the Prandtl 
number of 6.8. 
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THU1 


THU2 


THU3 


Tarang 


2 


2.142 


- 


- 


2.142 


3 


2.678 


- 


- 


2.678 


4 


3.040 


3.040 


- 


3.040 


6 


3.553 


3.553 


- 


3.553 


10 


4.247 


4.244 


- 


4.243 


20 


5.363 


5.333 


5.333 


5.333 


30 


6.173 


6.105 


6.105 


6.105 


40 


6.848 


6.742 


6.740 


6.740 


50 


7.441 


7.298 


7.295 


7.295 



Using the RBC module of Tarang, we also studied the energy spectra and 



fluxes of the velocity and temperature fields |l_2j , the Nusselt number scaling |ll| , 
and chaos and bifurcations near the onset of convection [13L Il4| . 

In the next section we will discuss the results of the MHD module of Tarang. 

6. Magnetohydrodynamic turbulence and dynamo 

The equations for the incompressible MHD turbulence [15| are 

<9 t u+(u-V)u = -Vp+(B- V)B + iA7 2 u + F u , (13) 

<9 t B + (u-V)B = (B- V)u + r ? V 2 B + F s , (14) 

Vu = VB = 0, (15) 

where u, B and p are the velocity-, magnetic-, and pressure (thermal+magnetic) 
fields respectively, v is the kinematic viscosity, and r\ is the magnetic diffusivity. 
The F u and F B are external forcing terms for the velocity and magnetic fields 
respectively. Typically, F B = 0, but Tarang implements F B for generality. 
The magnetic field B can be separated into its mean Bo and fluctuations b: 
B = Bo+b. The number of nonlinear terms in the above equations is four whose 
computation requires 27 FFTs. However, the number of FFT computations in 
terms of the Elsasser variables z ± = u ± b is only 15, thus saving significant 
computing time. We use 

(u • V)u - (B • V)B = (z~ • V)z+ + (z+ • V)z~ , (16) 

(u-V)B-(B-V)u = (z~ ■ V)z+ - (z+ ■ V)z~ (17) 

to compute the nonlinear terms. Thus, the time requirement for a MHD simu- 
lation would be around 15/9 times that for the fluid simulation. In Fig. |4] we 




Time 



Figure 6: Time evolution of total kinetic energy (top panel) and total magnetic energy (bottom 
panel) for a decaying MHD simulation with Taylor-Green vortex as an initial condition. Blue 
dots are Tarang's data points, while the solid lines are the lattice simulation result of Breyian- 
nis and Valougeorgis |16| . The three different curves reported here are for v = t] = 0.01, 0.05, 
0.1 from top to bottom. 
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plot the time taken per step for different set of processors on Shaheen. The 
results are consistent with the above estimates. Regarding the space complex- 
ity, an MHD simulation requires 27 arrays for storing u(k),B(k),B(r),u(r), 
F u (k), F B (k), nlin u (k), nlin (k), and three temporary fields. Hence the mem- 
ory requirement for a MHD simulation is 27/15 times that of a fluid simulation. 
We perform code validation of Tarang's MHD module using the results of 
Brcyiannis and Valougcorgis's [16| lattice kinetic simulations of three-dimensional 
decaying MHD. Following Breyiannis and Valougeorgis, we solve the MHD equa- 
tions inside a cube with periodic boundary conditions on all directions, and with 
a Taylor-Green vortex (given below) as an initial condition, 

u = [sin(x) cos(y) cos(z), — cos(x) sin(y) cos(-z), 0] , (18) 

B = [sin(x) sin(j/) cos(z),cos(x) cos(y) cos(z), 0] . (19) 

This Taylor-Green vortex is then allowed to evolve freely. The simulation box 
is discretized using 32 3 grid points. 

The results of this test case for different parameter values (y = i] = 0.01, 
0.05, 0.1) are presented in Fig. [5] The top and bottom panels exhibit the time 
evolution of the total kinetic- and magnetic energies respectively. Tarang's data 
points, illustrated using blue dots, are in excellent agreement with Breyiannis 
and Valougeorgis' results [16[, which is represented using solid lines. We thus 
verify the MHD module of Tarang. 

We have used Tarang to perform extensive simulations of dynamo transition 



under the Taylor-Green forcing [17|,[18(. Using Tarang, we have also computed 



the magnetic and kinetic energy spectra, various energy fluxes |15| . and shell- 
to-shell energy transfers for MHD turbulence; these results would be presented 
in a subsequent paper. 

In addition to the fluid, MHD, and Raylcigh-Bcnard convection solvers, 
Tarang has modules for simulating rotating turbulence, passive and active scalars, 
liquid metal flows, rotating convection |19j, and Kolmogorov flow. 

7. Conclusions 

In this paper we describe salient features and code validation of Tarang. 
Tarang passes several validation tests performed for fluid, Raylcigh-Bcnard con- 
vection, and magnetohydrodynamic solvers. We also report scaling analysis of 
Tarang and show that it exhibits excellent strong- and weak scaling up to sev- 
eral thousand processors. Tarang has been used for studying Rayleigh-Benard 
convection, dynamo, and magnetohydrodynamic turbulence. It has been ported 
to various computing platforms including the HPC system of IIT Kanpur, Sha- 
heen of KAUST, Param Yuva of the Centre for Advanced Computing (Punc), 
and EKA of the Computational Research Laboratory (Punc). 
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